Fast vehicle detection based on colored point cloud with bird’s eye view representation

RGB cameras and LiDAR are crucial sensors for autonomous vehicles that provide complementary information for accurate detection. Recent early-level fusion-based approaches, flourishing LiDAR data with camera features, may not accomplish promising performance ascribable to the immense difference between two modalities. This paper presents a simple and effective vehicle detection method based on an early-fusion strategy, unified 2D BEV grids, and feature fusion. The proposed method first eliminates many null point clouds through cor-calibration. It augments point cloud data by color information to generate 7D colored point cloud, and unifies augmented data into 2D BEV grids. The colored BEV maps can then be fed to any 2D convolution network. A peculiar Feature Fusion (2F) detection module is utilized to extract multiple scale features from BEV images. Experiments on the KITTI public benchmark and Nuscenes dataset show that fusing RGB image with point cloud rather than raw point cloud can lead to better detection accuracy. Besides, the inference time of the proposed method reaches 0.05 s/frame thanks to its simple and compact architecture.


Related work
In 3D scene perception, one of the primary challenges is that 3D data comprises a large field of view (FOV) of irregular and unorganized points, which not only requires in-depth consideration of data representation, but also needs to design the reasonable CNN architecture. Regarding the use of 3D LiDAR data for object detection in self-driving vehicles, various methods have been developed. This section introduces two mainstream methods, namely LiDAR-only and multi-modal data fusion.
Projected 2D views-based methods: to decrease the calculation burden of 3D data, some previous algorithms have tried to combat sparsity by projecting 3D data into 2D images as pseudo images. This operation allows for an intensified and compressed representation of point cloud, which proves that a mature 2D detection framework can be applied for prediction. The VeloFCN 2 , LMNet 3 , and FVNet 4 project point cloud as front view (FV), while BirdNet 5 , BirdNet+ 6 , Complex-YOLO 7 , RT3D 8 , PIXOR 9 , PointPillars 10 and HDNet 11 as bird's-eye-view (BEV) formats. Front view is similar to image and contains spatial coordinates. It is simpler to acquire the location and appearance. The BEV representation has further benefits than front view. For instance, it retains physical dimensions, directly provides the position on the ground surface, and the dimensions of the object do not modify due to distance. Moreover, the obfuscation issues of various objects occupying different spatial regions are solved.
Scientific Reports | (2023) 13:7447 | https://doi.org/10.1038/s41598-023-34479-z www.nature.com/scientificreports/ This projection solution has been intended to satisfy real-time requirements. Thus, the BEV representation has dominated the projection method. Voxel grids-based methods: voxel-based approaches 12 discrete 3D space into equal voxels and utilize standard 3D CNN 13,14 to extract the feature of each voxel grid. Li 15 proposes a framework to extend 2D FCN to 3D FCN for detection. The voxel-based feature extractor (VFE) 16 has increased receptive field and added additional context to extract features. SECOND 17 modified VoxelNet 16 by adding sparse convolution to eliminate empty voxels, which enhanced the efficiency of 3D convolution. To better discriminate obscured vehicles, SegVoxelNet 18 designs a depth-aware head endowed with different kernel sizes and convolutional layer expansion rates.
Pure point cloud-based methods: PointNet 19 has been a promising method for straightly handling 3D data without any additional transformation or pre-processing. Based on the benefits of PointNet in terms of translation in-variance, local connections, and shared parameters, some specialized versions 20,21 are subsequently launched. However, both computation capability and memory consumed to calculate 3D models increase cubically.
Early-level fusion-based methods fuse the perception data in each modality directly by spatial alignment and projection at data stage, such as Pointpainting 22 , which utilizes DeeplabV3+ to yield per-pixel labels, and then projects these labels back to 3D space to obtain the decorated point clouds. It does not fuse the high-level features of the different modalities at all. PointAugmenting 24 proposes to augment LiDAR points with the deep features extracted from the 2D image.
In Middle-level fusion-based methods [25][26][27][28][29][30][31][32][33][34][35][36][37][38][39] , the features are combined after feature extraction. MV3D 25 and AVOD 26 project point cloud into 2D projected views, then RGB images and 2D projected views are treated as inputs for different backbones, and then fuse the features from multiple views to predict bounding box. The two implements of BEV Fusion 31,32 unify features from multi-modal inputs, which project image features onto shared BEV space and concatenate it with LiDAR feature. Transfusion 33 relies on LiDAR BEV features and image guidance to generate object queries, and then utilizes attention mechanism to fuse these queries with image features. MSMDFusion 34 designs multi-modal interaction in BEV space and voxel space to align spatial features from different sensors. Liu 35 takes RGB image and two sparse depth maps as input, designs a spatial motion perception module to generate pseudo-LiDAR point cloud. Liang 36 presents several tasks to assist object detection, which including 2D object detection, ground estimation, and depth completion. Gu 37 proposes a cascade fusion strategy, the LiDAR data is sent to network to acquire sparse results in the first phrase, through sparse to dense module, the features of both modalities are merged in the fusion model. Some methods attempt to merge two modalities by sharing the information across 2D and 3D backbones. The core problem is the feature mismatching between RGB image and point cloud. Wang et al. 39 propose a sparse non-homogeneous pooling, and a projection matrix is utilized to transform features between RGB image and BEV. Contfuse 38 utilizes a ResNet-18 to extract features on RGB image and point cloud BEV view separately, then performs interpolation of each BEV pixel position with RGB features based on K-nearest neighbor research, and last employees a parametric continuous convolution network to project them onto the BEV plane to merge with the BEV features. To utilize continuous convolutions, it requires the identification of K nearest 3D points for each grid, which is computationally expensive operation that may not meet real-time requirements as the density.
Late-level fusion-based methods incorporate 2D and 3D results from independent networks based on interrelationships or special models. F-PointNet 40 and F-ConvNet 41 utilize 2D detection to yield 2D proposals, which are re-projected to the 3D space, and then feed into a PointNet-like network to predict the 3D corresponding bounding box. Based on this methodology, framework 42 applies YOLOV3 to obtain proposals, which are then cast onto frustums to yield highly accurate trajectories.
In contrast to existing methods that either utilize a complex pipeline to process different modalities or conduct late fusion, our simple yet effective fusion strategy allows the interaction between modalities to be learned at the early stage with a less computation-heavy network. We demonstrate that this approach is effective and has better detection results in distant and occluded objects, which can significantly improve detection performance.

Method
In this section, Fig. 1 gives an overview of the framework for 7D colored point cloud generation and proposed approach for vehicle detection in detail. It has two input modalities, comprising RGB images taken by the camera and sparse point cloud taken by Velodyne 64E LiDAR from KITTI 43. The main framework of this method comprises three modules. The first module is the Early fusion, which projects RGB image into the 3D space, augments point cloud data with color texture to generate 7D colored point cloud. The second module is BEV encoding format of 7D colored point clouds. It has been applied to unify 7D colored point cloud into 2D BEV grids and to convert point sets into feature vectors of uniform dimension. In the third module, BEV maps are fed into the Feature Fusion (2F) network to generate the proposals, and parameters are estimated from multi-layers of feature maps such as semantic class, boundary box, and orientation.
In this approach, the basic unit is 2D grids, it does not only reduce the point cloud's dimension, but also better saves memory due to the input of smaller size. Furthermore, the RPN network in this model can utilize a deeper pyramid structure to capture rich features for improved performance.
Early fusion module. Through  where R 0 rect is the rotation matrix, T cam velo is the transformation matrix and T proj is the projection matrix from camera coordinate systems.
In this way, image pixels are projected onto corresponding point data in 3D space according to projection matrix. Then the corresponding pixels (from RGB Camera) are assigned to the 3D data to generate 7D colored point cloud. Therefore, each obtained 7D colored point cloud not only contains 3D coordinates and the intensity of reflection, but also retains color and texture, which can be denoted as: In order to achieve real-time availability and reduce unnecessary computation, detection range is set to 3]m , discarding the remaining pixels. An illustration of 7D colored point cloud generation is illustrated in Fig. 2. Figure 3 represents an example of the different data on KITTI data set. The first row shows original RGB image, the second row shows 3D point cloud data within the image's field of RGB images, and the third row is 7-dimensional colored point cloud. The image provides road environment information, and the 3D LiDAR data presents the object scanned by the sensor and its surrounding environment. The colored point cloud enhances the semantic information of 3D point cloud. Therefore, the 7D colored data constructed in this section not only retains the spatial characteristics of point clouds, but also enriches the semantic characteristics of surface points, which can avoid the dependence of feature extractors of point clouds on the shape of objects. BEV encoding. This stage uses a priori knowledge and spatial ensemble constraints to process the generated 7D colored point cloud, and then obtains the rich and compact 6-channel BEV maps, which can be considered as a pseudo image. There are two advantages. Firstly, the BEV grids allow each object to occupy an individual spa- www.nature.com/scientificreports/ tial position, which facilitates the reflection of relative positional relationships between objects and reduces the disturbance of overlap and occlusion. Secondly, 6-channel BEV maps can be processed directly by conventional convolution structures, implying less computation and faster detection. The generated 7D colored point clouds are converted into the 2D grids and then converted into a 6-channel BEV map according to Eq. (4): In the above formula, H is the average height, I is the average intensity, D is the average density, R,G, B are the average primary colors, respectively. The conversion of generated 7D colored point cloud (left) and 6-channel colored BEV map (right) is illustrated in Fig. 4.
The specific conversion process is as follows: the obtained 7D colored point cloud is converted to x − y plane, and the detection area is sliced into 2D grids with an interval of 0.1 m. Each grid covers an area of 0.1m × 0.1m .
Step 1: The first channel is height map. The height feature is encoded as division of the maximum value (|H|max) in each grid with the height in the detection region. The obtained normalized height value is encoded in the grid, i.e.:  Step 2: The second channel is the intensity map. The intensity features are encoded as the average reflection intensity in each grid.
where I i is the intensity value of the i grid in the top view, I Pro j is intensity value of j th point, and n i is number of points falling on the i th grid.
Step 3: The third channel is density map, which is encoded by the counts of points within each grid. In this case, the density is normalized by the Eq. (7): where D i is the density of the i th grid in the top view, and n i is the count of 3D points falling in the i th grid. n max is the count of points in the grid with the highest density among all cells.
Step 4: The fourth to sixth channels are color features. The average value of color features in each grid is calculated respectively to get the average triplet R, G, B.
In summary, the 7D colored point cloud is encoded as a 6-channel BEV image represented by the height, intensity, density of both two modalities data. On the one hand, it has a regular and structured format that can be easily processed, on the other hand, it is compact and does not require 3D convolution, saving computational resources.
Feature fusion (2F) detection model. As illustrated in Fig. 5, the 2F model is basically an encoding-decoding framework, which applies ResNet-50 44 in combination with a Feature Pyramid Network (FPN) structure 45 . Since the obtained colored BEV maps can provide rich information, which are utilized as input. To  www.nature.com/scientificreports/ obtain accurate object location and semantics, the semantic texture of object is obtained by continuous downsampling, and then the feature maps of high and low-level are combined to achieve multi-level feature fusion.
In bottom to up path, feature pyramids are constructed using the feature maps of C2, C3, and C4 of ResNet-50 at the scales of 1/2, 1/4, and 1/8.
In up to bottom path. After encoding, feature maps from each layer are passed to up-bottom, those encoded feature maps are then up-sampled several times on the account of the above responding layers to recover the input feature resolution, fusing by using 3 × 3 convolution and element average fusion operations. For the feature layer with the same original size, it can be regarded as the equivalent stage.
Due to position errors from several sampling operations, bottom-up high-level features are combined with low-level detailed features. Then these BEV feature map with strong semantics and high resolution can be obtained. Finally, the feature maps with multi-scale fusion can be obtained by concatenation, then they are passed to two Fully Connected layers to predict results.
Therefore, the generated feature pyramids are utilized to generate 2D proposals, with the multi-scale feature maps being provided by the multi-scale feature aggregation module.
Loss function. The loss function for vehicle detection is similar to Pointpillars 10 and SECOND 17 . It consists of three components: smooth − l 1 loss for position regression, L cls loss for object classification, and L dir loss for direction (heading Angle).
The parameters of the detection box are defined by x, y, z, w, l, h, θ , where x, y, z are center coordinates of the 3D box, w, l, h are respectively width, length, and height, and θ is heading angle (object orientation). The parameters involved are as follows: In the above equation, x , y , z are the offset between ground truth x g and predicted values where p is the classification probability of predicted box, α is a weighted factor to balance strength of positive and negative examples, and γ is focusing parameter, and α are γ set to 0.25 and 2, respectively. (c) Directional (heading angle) loss L dir : since the angle has two directions {+, −} , and the angle regression loss can not distinguish the orientation. A softmax function is used to compute the discretized orientation loss. If the heading angle around Z-axis of the ground truth is greater than 0, the orientation is positive; otherwise, the orientation is negative.
By combining the losses discussed above, the overall loss function can be formulated as follows: where N pos is the number of correctly detected boxes, β loc ,β cls and β dir are weight of regression, classification, and direction, which are set to 2.0, 1.0, and 0.2, respectively.  www.nature.com/scientificreports/ coincidence degree between the predicted detection box and the object reaches the threshold, the prediction can be considered correct. The nuScenes dataset is much bigger than KITTI dataset. It has full annotations to support all sorts of tasks (3D object detection, tracking and BEV map segmentation).In this work, we utilize LiDAR point clouds and RGB images. 10 categories are evaluated: cars, trucks, buses, trailers, construction vehicles, pedestrians, motorcycles, bicycles,traffic cones and barriers. Implementation details. To verify the performance of the 3D object detection method on the basis of generated 7D colored point cloud. Training set is split into two non-overlapping subsets for training (3712) and validation (3769).

Experiments
In the training part, the whole model uses Adaptive Moment Estimation (Adam) as the network optimizer to update model parameters. Momentum ranges from 0.95 to 0.85, the initial learning rate is 0.001, fixed-weight decay coefficient is 0.001, the batch size is 12, and the epoch is 300. The model is executed with the Pytorch 1.6 with NVidia 1080ti GPU. During model evaluation, only this method involves 7D colored point cloud as input, and the others utilize the original 3D point cloud.

Experimental comparison between 7D colored point cloud and 3D point cloud.
To verify the effectiveness of the proposed fusion method with different combinations. Table 1 illustrates comparison of detection results of various input data on KITTI validation data. It uses 3D point cloud (first row) and 7D colored point cloud (second row) data as network input respectively. When using 3D point cloud, the input channel of network is changed to 3.
The difficulty levels of "Easy (E)", "Medium (M)", and "Hard (H)" are defined by the KITTI official website. As can be observed in the three categories of vehicle, pedestrian, and cyclists, the detection results obtained by using both modalities as input are higher than only using 3D point cloud. When detecting pedestrians and cyclists categories, the accuracy is improved significantly, the reason is that the color information of the image makes input semantic information richer, and the multi-scale fusion feature network has certain advantages in detecting distant objects.
To visually compare the detection effects of two different data, Fig. 6 presents some visualization detection results of occluded and distant objects on KITTI validation set, from both image view and bird 's-eye view, respectively. Among them, in each sub-block, left column is the detection result that only takes 3D point cloud as input, and right shows visualization results using 7D colored point cloud as input. Figure 6A shows the distant object detection results, and Fig. 6B shows occluded object detection results. The red detection boxes represent vehicles, the blue detection boxes represent cyclists, and the yellow detection boxes represent pedestrians. The green part in the detection box indicates the orientation of the object.
It can be concluded that in a complex and crowded environment, when using sparse pure 3D point cloud, the situation of missing detection occurs owing to the absence of color information. In addition, with many surrounding obstacles and other interfering factors, the image can make up for the low resolution of LIDAR to a certain extent and make the detection more accurate. In conclusion, the detection network using multi-modal data has good robustness, which alleviates the problems for frequently encountered problems in traffic scenarios.
In addition, in order to obtain an intuitive understanding of the detection performance, the prediction results of this method are compared with ground truth, and the visual display is carried out from partially occluded and distant targets respectively. In the validation set, the renderings of the 3D boundary box obtained by the algorithm in the RGB maps and the corresponding point clouds are shown in Fig. 7, where red represents the 3D boundary box (Ground truth) of real label, and the green represents the 3D box predicted by the proposed method.
As shown in Fig. 7, the accuracy of the center point and length, width, and height of 3D bounding box is infinitely close to ground truth. In summary, the detection network based on multi-modal data fusion has good robustness, which can alleviate the problem of missed and false detection of partial occlusion and distant targets that often occured in traffic scenes.  Table 2, the detailed Average Precison (mAP) and inference time for 3D vehicle detection and 2D BEV detection are reported. Difficulty levels of easy, moderate, and hard are utilized based on definitions from KITTI official website.  Table 2 that proposed method achieves AP values of 89.14% and 77.85% at easy and moderate levels respectively, with an average inference time of approximately 0.05 s, which is outperforming the result attained by most published methods. For the 2D BEV detection task, the results are approximately the same as 3D vehicle detection task.

Evaluation on KITTI object benchmark test dataset.
In comparison to LiDAR-based methods, the proposed method is better than existing methods in 2D BEV detection task but is slightly worse than Pointpillars 10 at the hard level. Moreover, the performance of this method exceeds VoxelNet 16 by 2.39%, and SECOND 17 by 3.67%, which demonstrates that this model performs better with RGB image and point cloud as input.
In comparison to multi-sensor fusion-based methods, in 3D detection task. From Table 2, it can be found that this method outperforms all previous methods except MVAF-Net 29 at moderate and hard levels. The proposed method outperforms the MV3D 25 by 14.17%, AVOD 26 by 12.75%, and Contfuse 38 by 6.6% at the easy level, respectively. For Contfuse uses the 3D points to project features from the image to 3D space, the "feature blurring" occurs because feature vector from BEV corresponds to multiple pixels in the image view. For 2D BEV detection task, this method outperforms all the previous methods except MMF 30 at the moderate level.
The reason for the inferior performance is that MMF 30 and MVAF-Net 29 are two-stage methods based on bounding boxes. MMF 30 and MVAF-Net 29 adopt multi-view fusion, which not only utilize different projected views of point cloud, but also fuse them in multiple stages. To be more specific, MMF 30 improves the performance of 3D detection through additional auxiliary tasks (including 2D object detection, ground estimation, and depth complementation), which adds additional labeling efforts.  www.nature.com/scientificreports/ Although both methods have higher detection performance, they require an additional bounding box optimization process, a more complicated network structure, and a large amount of calculation, which increase running time. Notably, the proposed method does not involve complex subsequent processing (such as IoU-based non-maximal suppression) to filter the overlapping results. Table 2 also lists the running time of this method with other above-mentioned methods. It can be noticed that this method takes approximately 0.05 s for inference, faster than multi-sensor fusion-based methods, in which the average computing speed on KITTI object detection data set can reach 19.6 FPS. Among the LiDAR-based  For a more intuitive analysis, Fig. 8 represents some predicted visualization results at RGB view and BEV view on the KITTI object detection test set. Pedestrians are indicated by yellow detection boxes, vehicles are represented by red detection boxes, cyclists are represented by blue detection boxes, and the green boundary is the corresponding orientation. Each sub-block is composed of RGB image and its corresponding LiDAR data. Note that the visualization of the prediction results is based on bird's-eye view generated from 7D colored point cloud and re-projected back onto the image for illustrative purposes only.
On KITTI Odometry sequence 05 dataset. Figure 9 shows some visualization results in RGB image of proposed method. Specifically, the visualization results of distant are on the left, and the occluded detection results are on the right. Pink represents 2D detection and the yellow represents 3D detection. It can be observed that under different illumination, this algorithm attains good detection results for vehicles. At the same time, the proposed method can better detect vehicle in presence of large traffic volume and partial occlusion. The quantitative experiment on KITTI Odometry Dataset can be shown in Table 3, including 3D vehicle detection and 2D detection.
Evaluation on nuScenes dataset. We further compare this method with some related methods on nuScenes dataset, which is a large-scale outdoor dataset. As can be shown in Table 4, the best results are in bold. Compared to pointpillars 10 , this method has received great increase in mAP, some categories such as traffic corns, pedestrian and bicycle often have few LiDAR points on them, thus the additional appearance features provided by color and texture information is extremely valuable. The reason that our method having a slightly lower performance than TransFusion 33 is that the soft-association mechanism to point clouds and image pixels is utilized, and the attention mechanism of TransFusion is adaptively determined where and what information should be taken from image. Some visualization results are shown in Fig. 10, the yellow presents car, the orange presents truck, and blue presents pedestrian.
Owing to the sparseness of 3D data, the number of scanned points for distant objects is very small, failing to extract features from enough data for recognition. This method adds color features to the 3D raw point cloud  Under the condition of low visibility and light sensitivity, image features cannot be extracted, which will produce undesirable results. This method is essentially calculated on the point cloud, it can still extract information  There are some issues to be considered for the future: first, this paper considers two modalities, when adding the third sensor data, the data distribution from different viewpoints may be different, in this case, accurate calibration is demanded in advance. Second, the network relies on the update of the 2D detector, and results are limited by the existing 2D detection network. At last, the early fusion strategy has been taken into consideration, and the combination form of various fusion strategies should be considered in the future.

Conclusion
This paper proposes a 3D vehicle detection network based on image and point cloud, which introduces an early fusion module, BEV encoding format, and Feature Fusion (2F) network. Unlike other fusion-based methods, through calibration parameters, image pixels are projected into corresponding LiDAR data in 3D space to generate 7D colored point clouds, and then the obtained 7D colored point clouds are converted to top view for feature extraction. The proposed method can be evaluated by end-to-end training. Experimental results demonstrate that it has better detection performance for occluded and distant vehicles.

Data availability
The datasets generated during the current study are available from the corresponding author on reasonable request.